In [6]:
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving energy_dataset.csv to energy_dataset.csv
In [7]:
#import the pandas libraries and assign as pd
import pandas as pd
#read the csv data file called"energy_dataset.csv"
energy_df=pd.read_csv("energy_dataset.csv")
# Display the contents of the DataFrame "energy_df"
energy_df
Out[7]:
time generation biomass generation fossil brown coal/lignite generation fossil coal-derived gas generation fossil gas generation fossil hard coal generation fossil oil generation fossil oil shale generation fossil peat generation geothermal ... generation waste generation wind offshore generation wind onshore forecast solar day ahead forecast wind offshore eday ahead forecast wind onshore day ahead total load forecast total load actual price day ahead price actual
0 2015-01-01 00:00:00+01:00 447.0 329.0 0.0 4844.0 4821.0 162.0 0.0 0.0 0.0 ... 196.0 0.0 6378.0 17.0 NaN 6436.0 26118.0 25385.0 50.10 65.41
1 2015-01-01 01:00:00+01:00 449.0 328.0 0.0 5196.0 4755.0 158.0 0.0 0.0 0.0 ... 195.0 0.0 5890.0 16.0 NaN 5856.0 24934.0 24382.0 48.10 64.92
2 2015-01-01 02:00:00+01:00 448.0 323.0 0.0 4857.0 4581.0 157.0 0.0 0.0 0.0 ... 196.0 0.0 5461.0 8.0 NaN 5454.0 23515.0 22734.0 47.33 64.48
3 2015-01-01 03:00:00+01:00 438.0 254.0 0.0 4314.0 4131.0 160.0 0.0 0.0 0.0 ... 191.0 0.0 5238.0 2.0 NaN 5151.0 22642.0 21286.0 42.27 59.32
4 2015-01-01 04:00:00+01:00 428.0 187.0 0.0 4130.0 3840.0 156.0 0.0 0.0 0.0 ... 189.0 0.0 4935.0 9.0 NaN 4861.0 21785.0 20264.0 38.41 56.04
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
35059 2018-12-31 19:00:00+01:00 297.0 0.0 0.0 7634.0 2628.0 178.0 0.0 0.0 0.0 ... 277.0 0.0 3113.0 96.0 NaN 3253.0 30619.0 30653.0 68.85 77.02
35060 2018-12-31 20:00:00+01:00 296.0 0.0 0.0 7241.0 2566.0 174.0 0.0 0.0 0.0 ... 280.0 0.0 3288.0 51.0 NaN 3353.0 29932.0 29735.0 68.40 76.16
35061 2018-12-31 21:00:00+01:00 292.0 0.0 0.0 7025.0 2422.0 168.0 0.0 0.0 0.0 ... 286.0 0.0 3503.0 36.0 NaN 3404.0 27903.0 28071.0 66.88 74.30
35062 2018-12-31 22:00:00+01:00 293.0 0.0 0.0 6562.0 2293.0 163.0 0.0 0.0 0.0 ... 287.0 0.0 3586.0 29.0 NaN 3273.0 25450.0 25801.0 63.93 69.89
35063 2018-12-31 23:00:00+01:00 290.0 0.0 0.0 6926.0 2166.0 163.0 0.0 0.0 0.0 ... 287.0 0.0 3651.0 26.0 NaN 3117.0 24424.0 24455.0 64.27 69.88

35064 rows × 29 columns

In [8]:
#creating the energy_loss column to the dataframe "energy_df"
# This line calculates the difference between the "total load forecast" column and the "total load actual" column

energy_df["energy_loss"]=energy_df["total load forecast"]-energy_df["total load actual"]
In [9]:
# Assigns the calculated series of energy loss values to a new column named "energy_loss" in the DataFrame energy_df

energy_df["energy_loss"]
Out[9]:
0         733.0
1         552.0
2         781.0
3        1356.0
4        1521.0
          ...  
35059     -34.0
35060     197.0
35061    -168.0
35062    -351.0
35063     -31.0
Name: energy_loss, Length: 35064, dtype: float64
In [10]:
#Display the list of columns names

energy_df.columns
Out[10]:
Index(['time', 'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil coal-derived gas', 'generation fossil gas',
       'generation fossil hard coal', 'generation fossil oil',
       'generation fossil oil shale', 'generation fossil peat',
       'generation geothermal', 'generation hydro pumped storage aggregated',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation marine',
       'generation nuclear', 'generation other', 'generation other renewable',
       'generation solar', 'generation waste', 'generation wind offshore',
       'generation wind onshore', 'forecast solar day ahead',
       'forecast wind offshore eday ahead', 'forecast wind onshore day ahead',
       'total load forecast', 'total load actual', 'price day ahead',
       'price actual', 'energy_loss'],
      dtype='object')
In [15]:
from google.colab import files
uploaded = files.upload()
Upload widget is only available when the cell has been executed in the current browser session. Please rerun this cell to enable.
Saving weather_features.csv to weather_features.csv
In [16]:
# Use the pandas function "read_csv()" to read the contents of the CSV file "weather_features.csv"
# The function returns a DataFrame containing the data from the CSV file

weather_df=pd.read_csv("weather_features.csv")
In [17]:
#Assigns the result of the merge operation to a new DataFrame named final.

final=weather_df.merge(energy_df,how="inner",left_on="dt_iso",right_on="time")
In [18]:
# display a summary of information about the DataFrame's structure and contents.

final.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 178396 entries, 0 to 178395
Data columns (total 47 columns):
 #   Column                                       Non-Null Count   Dtype  
---  ------                                       --------------   -----  
 0   dt_iso                                       178396 non-null  object 
 1   city_name                                    178396 non-null  object 
 2   temp                                         178396 non-null  float64
 3   temp_min                                     178396 non-null  float64
 4   temp_max                                     178396 non-null  float64
 5   pressure                                     178396 non-null  int64  
 6   humidity                                     178396 non-null  int64  
 7   wind_speed                                   178396 non-null  int64  
 8   wind_deg                                     178396 non-null  int64  
 9   rain_1h                                      178396 non-null  float64
 10  rain_3h                                      178396 non-null  float64
 11  snow_3h                                      178396 non-null  float64
 12  clouds_all                                   178396 non-null  int64  
 13  weather_id                                   178396 non-null  int64  
 14  weather_main                                 178396 non-null  object 
 15  weather_description                          178396 non-null  object 
 16  weather_icon                                 178396 non-null  object 
 17  time                                         178396 non-null  object 
 18  generation biomass                           178301 non-null  float64
 19  generation fossil brown coal/lignite         178306 non-null  float64
 20  generation fossil coal-derived gas           178306 non-null  float64
 21  generation fossil gas                        178306 non-null  float64
 22  generation fossil hard coal                  178306 non-null  float64
 23  generation fossil oil                        178301 non-null  float64
 24  generation fossil oil shale                  178306 non-null  float64
 25  generation fossil peat                       178306 non-null  float64
 26  generation geothermal                        178306 non-null  float64
 27  generation hydro pumped storage aggregated   0 non-null       float64
 28  generation hydro pumped storage consumption  178301 non-null  float64
 29  generation hydro run-of-river and poundage   178301 non-null  float64
 30  generation hydro water reservoir             178306 non-null  float64
 31  generation marine                            178301 non-null  float64
 32  generation nuclear                           178311 non-null  float64
 33  generation other                             178306 non-null  float64
 34  generation other renewable                   178306 non-null  float64
 35  generation solar                             178306 non-null  float64
 36  generation waste                             178301 non-null  float64
 37  generation wind offshore                     178306 non-null  float64
 38  generation wind onshore                      178306 non-null  float64
 39  forecast solar day ahead                     178396 non-null  float64
 40  forecast wind offshore eday ahead            0 non-null       float64
 41  forecast wind onshore day ahead              178396 non-null  float64
 42  total load forecast                          178396 non-null  float64
 43  total load actual                            178216 non-null  float64
 44  price day ahead                              178396 non-null  float64
 45  price actual                                 178396 non-null  float64
 46  energy_loss                                  178216 non-null  float64
dtypes: float64(35), int64(6), object(6)
memory usage: 65.3+ MB
In [19]:
#display a summary of information about the DataFrame's structure and contents.

weather_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178396 entries, 0 to 178395
Data columns (total 17 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   dt_iso               178396 non-null  object 
 1   city_name            178396 non-null  object 
 2   temp                 178396 non-null  float64
 3   temp_min             178396 non-null  float64
 4   temp_max             178396 non-null  float64
 5   pressure             178396 non-null  int64  
 6   humidity             178396 non-null  int64  
 7   wind_speed           178396 non-null  int64  
 8   wind_deg             178396 non-null  int64  
 9   rain_1h              178396 non-null  float64
 10  rain_3h              178396 non-null  float64
 11  snow_3h              178396 non-null  float64
 12  clouds_all           178396 non-null  int64  
 13  weather_id           178396 non-null  int64  
 14  weather_main         178396 non-null  object 
 15  weather_description  178396 non-null  object 
 16  weather_icon         178396 non-null  object 
dtypes: float64(6), int64(6), object(5)
memory usage: 23.1+ MB
In [20]:
#count of non-null values in the corresponding column of the DataFrame.
energy_df.count()
Out[20]:
time                                           35064
generation biomass                             35045
generation fossil brown coal/lignite           35046
generation fossil coal-derived gas             35046
generation fossil gas                          35046
generation fossil hard coal                    35046
generation fossil oil                          35045
generation fossil oil shale                    35046
generation fossil peat                         35046
generation geothermal                          35046
generation hydro pumped storage aggregated         0
generation hydro pumped storage consumption    35045
generation hydro run-of-river and poundage     35045
generation hydro water reservoir               35046
generation marine                              35045
generation nuclear                             35047
generation other                               35046
generation other renewable                     35046
generation solar                               35046
generation waste                               35045
generation wind offshore                       35046
generation wind onshore                        35046
forecast solar day ahead                       35064
forecast wind offshore eday ahead                  0
forecast wind onshore day ahead                35064
total load forecast                            35064
total load actual                              35028
price day ahead                                35064
price actual                                   35064
energy_loss                                    35028
dtype: int64
In [21]:
import plotly.express as px
time_line_dict= [{
    "Task":"Defining objectives",
    "Start":"2023-7-3",
    "End":"2023-7-4"
},
{"Task":"Data Collection",
"Start":"2023-7-5",
"End":"2023-7-7"},
    {"Task":"Data Exploration",
"Start":"2023-7-5",
"End":"2023-7-7"},
    {"Task":"Feature Selection and Engineering",
"Start":"2023-7-11",
"End":"2023-7-15"},
    {"Task":"Model Selection and Developement",
"Start":"2023-7-15",
"End":"2023-7-20"},{"Task":"Model Evaluation and Refinement",
"Start":"2023-7-20",
"End":"2023-7-27"},
    {"Task":"Interpretation and presentation of results",
"Start":"2023-7-28",
"End":"2023-8-3"},
    {"Task":"deployment and monitoring",
"Start":"2023-8-4",
"End":"2023-8-13"}
]
fig=px.timeline(pd.DataFrame(time_line_dict), x_start="Start", x_end="End", y="Task")
fig.update_yaxes(autorange="reversed") # otherwise tasks are listed from the bottom up
fig.show()
In [22]:
#drop the specified columns from the DataFrame "final"
final.drop(columns=["forecast wind offshore eday ahead","generation hydro pumped storage aggregated"],axis=1,inplace=True)
In [23]:
# List of columns to be summed from the DataFrame
#calculating the total energy generation across different sources.
columns_to_sum=['generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil coal-derived gas', 'generation fossil gas',
       'generation fossil hard coal', 'generation fossil oil',
       'generation fossil oil shale', 'generation fossil peat',
       'generation geothermal',
       'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation marine',
       'generation nuclear', 'generation other', 'generation other renewable',
       'generation solar', 'generation waste', 'generation wind offshore',
       'generation wind onshore']
In [24]:
# Display data types of columns in the DataFrame "energy_df"
energy_df.dtypes
Out[24]:
time                                            object
generation biomass                             float64
generation fossil brown coal/lignite           float64
generation fossil coal-derived gas             float64
generation fossil gas                          float64
generation fossil hard coal                    float64
generation fossil oil                          float64
generation fossil oil shale                    float64
generation fossil peat                         float64
generation geothermal                          float64
generation hydro pumped storage aggregated     float64
generation hydro pumped storage consumption    float64
generation hydro run-of-river and poundage     float64
generation hydro water reservoir               float64
generation marine                              float64
generation nuclear                             float64
generation other                               float64
generation other renewable                     float64
generation solar                               float64
generation waste                               float64
generation wind offshore                       float64
generation wind onshore                        float64
forecast solar day ahead                       float64
forecast wind offshore eday ahead              float64
forecast wind onshore day ahead                float64
total load forecast                            float64
total load actual                              float64
price day ahead                                float64
price actual                                   float64
energy_loss                                    float64
dtype: object
In [25]:
# Import the "pyplot" submodule from the "matplotlib" library, which provides plotting functions

import matplotlib.pyplot as plt

# the "plot()" function from the "pyplot" submodule to create a line plot
# The x-axis data is taken from the "time" column of the first 1000 rows of the DataFrame "energy_df"
# The y-axis data is taken from the "energy_loss" column of the same 1000 rows
plt.plot(energy_df[:1000]["time"],energy_df[:1000]["energy_loss"])
plt.show()
In [26]:
#uses the corr() method on the DataFrame 'final' to compute the correlation matrix.

final.corr().columns
<ipython-input-26-fce803a60abf>:3: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Out[26]:
Index(['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil coal-derived gas', 'generation fossil gas',
       'generation fossil hard coal', 'generation fossil oil',
       'generation fossil oil shale', 'generation fossil peat',
       'generation geothermal', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation marine',
       'generation nuclear', 'generation other', 'generation other renewable',
       'generation solar', 'generation waste', 'generation wind offshore',
       'generation wind onshore', 'forecast solar day ahead',
       'forecast wind onshore day ahead', 'total load forecast',
       'total load actual', 'price day ahead', 'price actual', 'energy_loss'],
      dtype='object')
In [27]:
# Import the "seaborn" library for data visualization tools
import seaborn as sns

# The resulting heatmap provides a visual representation of the correlation relationships
sns.heatmap(final.corr());
<ipython-input-27-eb325e8a509d>:5: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

In [28]:
final.corr()
<ipython-input-28-4fee8eca12fa>:1: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Out[28]:
temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h rain_3h snow_3h ... generation waste generation wind offshore generation wind onshore forecast solar day ahead forecast wind onshore day ahead total load forecast total load actual price day ahead price actual energy_loss
temp 1.000000 0.974541 0.966853 -0.008833 -0.573542 0.115307 -0.052199 -0.066632 -0.010022 -0.039008 ... 0.078189 NaN -0.125695 0.383305 -0.126883 0.179700 0.181200 0.061611 0.069932 -0.007603
temp_min 0.974541 1.000000 0.892425 -0.007505 -0.569617 0.113380 -0.041872 -0.071634 -0.003528 -0.035890 ... 0.086142 NaN -0.118171 0.375994 -0.119198 0.178020 0.179328 0.073644 0.080857 -0.005731
temp_max 0.966853 0.892425 1.000000 -0.009710 -0.534234 0.101714 -0.067548 -0.061496 -0.016446 -0.040011 ... 0.065756 NaN -0.128128 0.367203 -0.129384 0.168685 0.170405 0.042390 0.051325 -0.010385
pressure -0.008833 -0.007505 -0.009710 1.000000 -0.027458 0.001379 0.002265 0.039309 -0.000465 -0.000200 ... -0.012980 NaN 0.010336 -0.003001 0.010100 -0.000906 -0.000990 -0.009851 -0.007214 0.000832
humidity -0.573542 -0.569617 -0.534234 -0.027458 1.000000 -0.250336 -0.029316 0.134445 0.014036 0.023744 ... 0.002689 NaN -0.026042 -0.390739 -0.024132 -0.245748 -0.245296 -0.025828 -0.024741 -0.014864
wind_speed 0.115307 0.113380 0.101714 0.001379 -0.250336 1.000000 0.261888 0.052220 -0.019366 -0.006230 ... -0.048364 NaN 0.211037 0.137233 0.210601 0.125179 0.126286 -0.079933 -0.146129 -0.006039
wind_deg -0.052199 -0.041872 -0.067548 0.002265 -0.029316 0.261888 1.000000 0.039426 0.002445 -0.014599 ... -0.049592 NaN 0.094539 -0.051249 0.094577 -0.039849 -0.041705 -0.078951 -0.099958 0.015604
rain_1h -0.066632 -0.071634 -0.061496 0.039309 0.134445 0.052220 0.039426 1.000000 -0.009862 0.040347 ... -0.075450 NaN 0.064244 -0.013872 0.064044 0.011445 0.012259 -0.035598 -0.035814 -0.009319
rain_3h -0.010022 -0.003528 -0.016446 -0.000465 0.014036 -0.019366 0.002445 -0.009862 1.000000 -0.001063 ... -0.043109 NaN 0.000168 0.002119 0.000262 -0.002777 -0.003210 -0.014641 -0.009344 0.003499
snow_3h -0.039008 -0.035890 -0.040011 -0.000200 0.023744 -0.006230 -0.014599 0.040347 -0.001063 1.000000 ... -0.033426 NaN -0.000810 0.008593 -0.000858 -0.004551 -0.004486 -0.002330 0.006581 -0.001467
clouds_all -0.221331 -0.208759 -0.226416 0.004443 0.400483 0.051049 0.034008 0.229401 0.024327 0.044464 ... -0.036354 NaN 0.070695 -0.044241 0.070136 0.012446 0.013725 -0.016462 -0.052895 -0.012542
weather_id 0.157494 0.157292 0.149840 -0.004053 -0.290514 -0.042262 -0.030328 -0.461414 0.020114 -0.050192 ... 0.044629 NaN -0.075360 0.056795 -0.075355 -0.001231 -0.002647 0.015878 0.031498 0.014769
generation biomass 0.035562 0.026501 0.037321 0.006566 -0.022922 -0.022654 0.016182 0.025411 0.038221 0.014048 ... -0.346049 NaN -0.071454 -0.008640 -0.075154 0.084213 0.082126 0.108610 0.139684 0.026574
generation fossil brown coal/lignite 0.060444 0.057921 0.059616 -0.009407 0.009193 -0.096152 -0.072467 -0.044544 -0.002908 -0.006839 ... 0.281694 NaN -0.433778 0.042036 -0.435710 0.277840 0.279773 0.567427 0.362119 -0.006434
generation fossil coal-derived gas NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation fossil gas 0.098760 0.103899 0.085671 -0.007174 -0.067029 -0.058576 -0.073656 -0.035887 -0.016535 -0.009964 ... 0.273775 NaN -0.396052 0.079422 -0.395979 0.544385 0.549438 0.641082 0.461567 -0.029822
generation fossil hard coal 0.075921 0.070676 0.074213 -0.009351 -0.022854 -0.088650 -0.064558 -0.026681 0.009503 -0.002292 ... 0.169388 NaN -0.441006 0.046475 -0.443574 0.394580 0.396735 0.671350 0.463768 -0.005262
generation fossil oil 0.098213 0.095109 0.090992 -0.002862 -0.093322 -0.010796 -0.020351 0.004391 0.018493 0.001296 ... -0.175926 NaN -0.052026 0.096486 -0.058492 0.498399 0.496656 0.291538 0.283570 0.045628
generation fossil oil shale NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation fossil peat NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation geothermal NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation hydro pumped storage consumption -0.200733 -0.200461 -0.187952 0.008045 0.135608 0.029410 0.067193 0.007115 -0.001016 -0.006052 ... -0.187205 NaN 0.387939 -0.222139 0.389391 -0.559705 -0.562719 -0.600289 -0.425665 0.008499
generation hydro run-of-river and poundage -0.094031 -0.083967 -0.097980 0.006282 -0.015607 0.103535 0.053272 0.039117 0.003958 0.015146 ... -0.284832 NaN 0.223480 0.044964 0.226772 0.120677 0.118790 -0.294699 -0.136326 0.025318
generation hydro water reservoir -0.016122 -0.022264 -0.006460 0.009787 -0.059711 0.070773 0.011599 0.036395 0.010565 0.009965 ... -0.287344 NaN -0.018325 0.102313 -0.010482 0.476886 0.479831 -0.017583 0.072349 -0.009323
generation marine NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation nuclear -0.027666 -0.035792 -0.017446 0.009016 0.013120 0.013945 0.003441 0.021597 -0.003753 -0.005751 ... 0.087068 NaN 0.050326 0.000454 0.046849 0.090565 0.086333 -0.044226 -0.052200 0.051067
generation other -0.029561 -0.035290 -0.023901 0.010092 0.009448 -0.010507 0.019226 0.025908 0.029510 0.007493 ... -0.360662 NaN 0.045954 -0.019601 0.043062 0.101230 0.100589 0.044224 0.099534 0.013591
generation other renewable 0.000076 0.021759 -0.021381 -0.009222 -0.013323 -0.012568 -0.046320 -0.048473 -0.047161 -0.029162 ... 0.613788 NaN -0.135310 0.027202 -0.136993 0.178783 0.182773 0.429029 0.257654 -0.026187
generation solar 0.380767 0.373690 0.364353 -0.003049 -0.393232 0.136741 -0.049131 -0.014371 0.001884 0.010418 ... 0.000678 NaN -0.166908 0.993225 -0.172551 0.397345 0.394375 0.057769 0.097720 0.047448
generation waste 0.078189 0.086142 0.065756 -0.012980 0.002689 -0.048364 -0.049592 -0.075450 -0.043109 -0.033426 ... 1.000000 NaN -0.179539 0.000844 -0.183996 0.076476 0.078378 0.368187 0.170182 -0.010609
generation wind offshore NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
generation wind onshore -0.125695 -0.118171 -0.128128 0.010336 -0.026042 0.211037 0.094539 0.064244 0.000168 -0.000810 ... -0.179539 NaN 1.000000 -0.170121 0.994405 0.039633 0.042074 -0.422481 -0.218339 -0.021957
forecast solar day ahead 0.383305 0.375994 0.367203 -0.003001 -0.390739 0.137233 -0.051249 -0.013872 0.002119 0.008593 ... 0.000844 NaN -0.170121 1.000000 -0.174701 0.404597 0.402687 0.061788 0.100664 0.041717
forecast wind onshore day ahead -0.126883 -0.119198 -0.129384 0.010100 -0.024132 0.210601 0.094577 0.064044 0.000262 -0.000858 ... -0.183996 NaN 0.994405 -0.174701 1.000000 0.037186 0.039649 -0.426549 -0.219190 -0.020653
total load forecast 0.179700 0.178020 0.168685 -0.000906 -0.245748 0.125179 -0.039849 0.011445 -0.002777 -0.004551 ... 0.076476 NaN 0.039633 0.404597 0.037186 1.000000 0.995150 0.475440 0.435944 0.088711
total load actual 0.181200 0.179328 0.170405 -0.000990 -0.245296 0.126286 -0.041705 0.012259 -0.003210 -0.004486 ... 0.078378 NaN 0.042074 0.402687 0.039649 0.995150 1.000000 0.474668 0.436263 -0.009703
price day ahead 0.061611 0.073644 0.042390 -0.009851 -0.025828 -0.079933 -0.078951 -0.035598 -0.014641 -0.002330 ... 0.368187 NaN -0.422481 0.061788 -0.426549 0.475440 0.474668 1.000000 0.730636 0.022470
price actual 0.069932 0.080857 0.051325 -0.007214 -0.024741 -0.146129 -0.099958 -0.035814 -0.009344 0.006581 ... 0.170182 NaN -0.218339 0.100664 -0.219190 0.435944 0.436263 0.730636 1.000000 0.020216
energy_loss -0.007603 -0.005731 -0.010385 0.000832 -0.014864 -0.006039 0.015604 -0.009319 0.003499 -0.001467 ... -0.010609 NaN -0.021957 0.041717 -0.020653 0.088711 -0.009703 0.022470 0.020216 1.000000

39 rows × 39 columns

In [29]:
# Define a list of column names that have no significant correlation with other columns
columns_with_no_corr=["generation fossil oil shale","generation geothermal","generation wind offshore","generation marine","generation fossil coal-derived gas"]
In [30]:
final.drop(columns=columns_with_no_corr,axis=1,inplace=True)

final.drop(columns=["generation fossil peat"],axis=1,inplace=True)

import seaborn as sns
sns.heatmap(final.corr());
<ipython-input-30-cae2ab37ff4a>:6: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

In [31]:
#compute the correlation matrix for the DataFrame final and then retrieve the column names for which correlations were computed.

final.corr().columns
<ipython-input-31-71e34a2c95b7>:3: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Out[31]:
Index(['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'forecast solar day ahead', 'forecast wind onshore day ahead',
       'total load forecast', 'total load actual', 'price day ahead',
       'price actual', 'energy_loss'],
      dtype='object')
In [32]:
#final["total load forecast"].hist()
sns.histplot(x=final["total load forecast"])
Out[32]:
<Axes: xlabel='total load forecast', ylabel='Count'>
In [33]:
# Import necessary libraries from scikit-learn

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree
from sklearn.preprocessing import MinMaxScaler
In [34]:
# Import necessary libraries
from datetime import datetime
# Convert the 'dt_iso' column to datetime format
final["dt_iso"]= final["dt_iso"].apply(lambda x : datetime.fromisoformat(x.replace("Z", "+00:00")))

from statsmodels.tsa.seasonal import seasonal_decompose

# Perform seasonal decomposition for 'total load forecast' in the city of Valencia using additive model
result_add = seasonal_decompose(final[final["city_name"]=="Valencia"]["total load forecast"], model='additive', extrapolate_trend='freq', period=8760)

# Plot
plt.rcParams.update({'figure.figsize': (10,20)})
result_add.plot().suptitle('Additive Decomposition', fontsize=22)
plt.show()
In [35]:
final
Out[35]:
dt_iso city_name temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h ... generation solar generation waste generation wind onshore forecast solar day ahead forecast wind onshore day ahead total load forecast total load actual price day ahead price actual energy_loss
0 2015-01-01 00:00:00+01:00 Valencia 270.475000 270.475000 270.475000 1001 77 1 62 0.0 ... 49.0 196.0 6378.0 17.0 6436.0 26118.0 25385.0 50.10 65.41 733.0
1 2015-01-01 00:00:00+01:00 Madrid 267.325000 267.325000 267.325000 971 63 1 309 0.0 ... 49.0 196.0 6378.0 17.0 6436.0 26118.0 25385.0 50.10 65.41 733.0
2 2015-01-01 00:00:00+01:00 Bilbao 269.657312 269.657312 269.657312 1036 97 0 226 0.0 ... 49.0 196.0 6378.0 17.0 6436.0 26118.0 25385.0 50.10 65.41 733.0
3 2015-01-01 00:00:00+01:00 Barcelona 281.625000 281.625000 281.625000 1035 100 7 58 0.0 ... 49.0 196.0 6378.0 17.0 6436.0 26118.0 25385.0 50.10 65.41 733.0
4 2015-01-01 00:00:00+01:00 Seville 273.375000 273.375000 273.375000 1039 75 1 21 0.0 ... 49.0 196.0 6378.0 17.0 6436.0 26118.0 25385.0 50.10 65.41 733.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
178391 2018-12-31 23:00:00+01:00 Valencia 279.140000 278.150000 280.150000 1029 75 2 300 0.0 ... 31.0 287.0 3651.0 26.0 3117.0 24424.0 24455.0 64.27 69.88 -31.0
178392 2018-12-31 23:00:00+01:00 Madrid 275.150000 275.150000 275.150000 1031 74 1 360 0.0 ... 31.0 287.0 3651.0 26.0 3117.0 24424.0 24455.0 64.27 69.88 -31.0
178393 2018-12-31 23:00:00+01:00 Bilbao 275.600000 275.150000 276.150000 1034 93 2 100 0.0 ... 31.0 287.0 3651.0 26.0 3117.0 24424.0 24455.0 64.27 69.88 -31.0
178394 2018-12-31 23:00:00+01:00 Barcelona 280.130000 277.150000 283.150000 1028 100 5 310 0.0 ... 31.0 287.0 3651.0 26.0 3117.0 24424.0 24455.0 64.27 69.88 -31.0
178395 2018-12-31 23:00:00+01:00 Seville 283.970000 282.150000 285.150000 1029 70 3 50 0.0 ... 31.0 287.0 3651.0 26.0 3117.0 24424.0 24455.0 64.27 69.88 -31.0

178396 rows × 39 columns

In [36]:
weather_df
Out[36]:
dt_iso city_name temp temp_min temp_max pressure humidity wind_speed wind_deg rain_1h rain_3h snow_3h clouds_all weather_id weather_main weather_description weather_icon
0 2015-01-01 00:00:00+01:00 Valencia 270.475 270.475 270.475 1001 77 1 62 0.0 0.0 0.0 0 800 clear sky is clear 01n
1 2015-01-01 01:00:00+01:00 Valencia 270.475 270.475 270.475 1001 77 1 62 0.0 0.0 0.0 0 800 clear sky is clear 01n
2 2015-01-01 02:00:00+01:00 Valencia 269.686 269.686 269.686 1002 78 0 23 0.0 0.0 0.0 0 800 clear sky is clear 01n
3 2015-01-01 03:00:00+01:00 Valencia 269.686 269.686 269.686 1002 78 0 23 0.0 0.0 0.0 0 800 clear sky is clear 01n
4 2015-01-01 04:00:00+01:00 Valencia 269.686 269.686 269.686 1002 78 0 23 0.0 0.0 0.0 0 800 clear sky is clear 01n
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
178391 2018-12-31 19:00:00+01:00 Seville 287.760 287.150 288.150 1028 54 3 30 0.0 0.0 0.0 0 800 clear sky is clear 01n
178392 2018-12-31 20:00:00+01:00 Seville 285.760 285.150 286.150 1029 62 3 30 0.0 0.0 0.0 0 800 clear sky is clear 01n
178393 2018-12-31 21:00:00+01:00 Seville 285.150 285.150 285.150 1028 58 4 50 0.0 0.0 0.0 0 800 clear sky is clear 01n
178394 2018-12-31 22:00:00+01:00 Seville 284.150 284.150 284.150 1029 57 4 60 0.0 0.0 0.0 0 800 clear sky is clear 01n
178395 2018-12-31 23:00:00+01:00 Seville 283.970 282.150 285.150 1029 70 3 50 0.0 0.0 0.0 0 800 clear sky is clear 01n

178396 rows × 17 columns

In [37]:
# Calculate the rolling mean of the 'total load forecast' for the city of Valencia
# The rolling window size is set to 30 days (720 hours) to compute a moving average

energy_mean = final[final["city_name"]=="Valencia"]["total load forecast"].rolling(window=30*24).mean()

# Plot the rolling mean of energy load forecast for visualization
# The 'figsize' parameter adjusts the size of the plot
energy_mean.plot(figsize=(20,15))
Out[37]:
<Axes: >
In [38]:
# The 'energy_mean' DataFrame contains the rolling mean of the 'total load forecast' for the city of Valencia.

energy_mean
Out[38]:
0                  NaN
5                  NaN
10                 NaN
15                 NaN
20                 NaN
              ...     
178371    28783.105556
178376    28777.430556
178381    28773.012500
178386    28770.676389
178391    28769.665278
Name: total load forecast, Length: 35145, dtype: float64
In [39]:
# Define a function to calculate the maximum of the last 'n' values in a series excluding the current value at each index
def max_of_last_n_excluding_self(series, n):
    max_values = []

    for i in range(len(series)):
        if i >= n:
            max_value = max(series[i - n : i])  # Exclude the current value at index i
            max_values.append(max_value)
        else:
            max_values.append(None)  # Not enough data for the first n indices

    return max_values

# Create a DataFrame 'actual_df' containing the actual 'total load forecast' for the city of Valencia
actual_df = final[final["city_name"]=="Valencia"]["total load forecast"].to_frame().rename(columns = {"total load forecast": "total load forecast_actual" })
# Calculate the maximum of the last 24 hours' load forecasts (excluding the current hour) and add it as a new column
actual_df["total load forecast_pred"] = max_of_last_n_excluding_self(final[final["city_name"]=="Valencia"]["total load forecast"],24)
# Concatenate the actual and predicted temperature


# Select from the second row, because there is no prediction for today due to shifting.

actual_df.dropna(inplace=True)
In [40]:
actual_df
Out[40]:
total load forecast_actual total load forecast_pred
120 27309.0 30739.0
125 25397.0 30739.0
130 23640.0 30739.0
135 22638.0 30739.0
140 22238.0 30739.0
... ... ...
178371 30619.0 30378.0
178376 29932.0 30619.0
178381 27903.0 30619.0
178386 25450.0 30619.0
178391 24424.0 30619.0

35121 rows × 2 columns

In [41]:
# Create a DataFrame 'actual_df' containing the actual 'total load forecast' for the city of Valencia
actual_df = final[final["city_name"]=="Valencia"]["total load forecast"].to_frame().rename(columns = {"total load forecast": "total load forecast_actual" })

# Calculate the rolling mean of the 'total load forecast' over a window of 24 hours and add it as a new column
actual_df["total load forecast_pred"] = final[final["city_name"]=="Valencia"]["total load forecast"].rolling(window=24).mean()

# Drop rows with missing values (NaN) in the DataFrame
actual_df.dropna(inplace=True)
In [42]:
# Initialize a MinMaxScaler for feature scaling
scalar = MinMaxScaler()

# Apply MinMax scaling to the 'actual_df' DataFrame
scaled_onestep=scalar.fit_transform(actual_df)
In [43]:
actual_df
Out[43]:
total load forecast_actual total load forecast_pred
115 27589.0 24703.625000
120 27309.0 24753.250000
125 25397.0 24772.541667
130 23640.0 24777.750000
135 22638.0 24777.583333
... ... ...
178371 30619.0 26266.833333
178376 29932.0 26258.166667
178381 27903.0 26155.041667
178386 25450.0 25989.708333
178391 24424.0 25877.458333

35122 rows × 2 columns

In [44]:
scaled_onestep
Out[44]:
array([[0.40730084, 0.21785244],
       [0.39527593, 0.22127015],
       [0.31316298, 0.22259878],
       ...,
       [0.42078591, 0.31781255],
       [0.31543912, 0.30642593],
       [0.27137642, 0.29869519]])
In [45]:
# Import necessary libraries
from sklearn.metrics import mean_squared_error as MSE
from math import sqrt

# Calculate the Root Mean Squared Error (RMSE) between two sets of scaled values
# The first column contains the actual values, and the second column contains the predicted values
temp_pred_err = MSE(scaled_onestep[:,0],scaled_onestep[:,1],squared=True)

# Print the calculated RMSE
print("The RMSE is",temp_pred_err)
The RMSE is 0.0386542457438511
In [46]:
scaled_onestep
Out[46]:
array([[0.40730084, 0.21785244],
       [0.39527593, 0.22127015],
       [0.31316298, 0.22259878],
       ...,
       [0.42078591, 0.31781255],
       [0.31543912, 0.30642593],
       [0.27137642, 0.29869519]])
In [47]:
# Create a DataFrame 'energy_series' to store scaled energy predictions and actual loads
energy_series=pd.DataFrame(scaled_onestep[:,1],columns=["Energy_predictions"])
energy_series["actual_load"]=scaled_onestep[:,0]

# Calculate the rolling mean of the energy predictions and actual loads over a window of 30 days (720 hours)
energy_mean=energy_series.rolling(window=30*24).mean()

#plot the mean of energy
energy_mean.plot(figsize=(20,15))
Out[47]:
<Axes: >
In [48]:
scaled_onestep[:,0]
Out[48]:
array([0.40730084, 0.39527593, 0.31316298, ..., 0.42078591, 0.31543912,
       0.27137642])
In [49]:
scaled_onestep
Out[49]:
array([[0.40730084, 0.21785244],
       [0.39527593, 0.22127015],
       [0.31316298, 0.22259878],
       ...,
       [0.42078591, 0.31781255],
       [0.31543912, 0.30642593],
       [0.27137642, 0.29869519]])
In [50]:
import itertools

# Define the p, d and q parameters to take any value between 0 and 2
p = d = q = range(0, 2)

# Generate all different combinations of p, q and q triplets
pdq = list(itertools.product(p, d, q))

# Generate all different combinations of seasonal p, q and q triplets
seasonal_pdq = [(x[0], x[1], x[2], 12) for x in list(itertools.product(p, d, q))]
In [54]:
final.dtypes
Out[54]:
dt_iso                                          object
city_name                                       object
temp                                           float64
temp_min                                       float64
temp_max                                       float64
pressure                                         int64
humidity                                         int64
wind_speed                                       int64
wind_deg                                         int64
rain_1h                                        float64
rain_3h                                        float64
snow_3h                                        float64
clouds_all                                       int64
weather_id                                       int64
weather_main                                    object
weather_description                             object
weather_icon                                    object
time                                            object
generation biomass                             float64
generation fossil brown coal/lignite           float64
generation fossil gas                          float64
generation fossil hard coal                    float64
generation fossil oil                          float64
generation hydro pumped storage consumption    float64
generation hydro run-of-river and poundage     float64
generation hydro water reservoir               float64
generation nuclear                             float64
generation other                               float64
generation other renewable                     float64
generation solar                               float64
generation waste                               float64
generation wind onshore                        float64
forecast solar day ahead                       float64
forecast wind onshore day ahead                float64
total load forecast                            float64
total load actual                              float64
price day ahead                                float64
price actual                                   float64
energy_loss                                    float64
dtype: object
In [55]:
# Checking linear Regression
final_x=final[final["city_name"]=="Valencia"][['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'forecast solar day ahead', 'forecast wind onshore day ahead',
       'total load forecast']].dropna()
scaler = MinMaxScaler()

# Fit the scaler on the data and transform it
X = final_x[['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'forecast solar day ahead', 'forecast wind onshore day ahead']]
y = final_x[["total load forecast"]]

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
X_train=scaler.fit_transform(X_train)
X_test=scaler.fit_transform(X_test)
y_train=scaler.fit_transform(y_train)
y_test=scaler.fit_transform(y_test)
In [57]:
# Create Decision Tree Classifier object

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import plot_tree

# Create a Decision Tree Regressor object with a specified maximum depth
clf = DecisionTreeRegressor(max_depth=8)

# Train the Decision Tree Regressor on the training dataset
clf = clf.fit(X_train,y_train)

# Predict the response for test dataset
y_pred = clf.predict(X_test)

# Create a figure for plotting the decision tree
fig=plt.figure(figsize=(45,10))

# Plot the decision tree with filled nodes
plot_tree(clf,filled=True,feature_names=['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'forecast solar day ahead', 'forecast wind onshore day ahead'],fontsize=20)

# Display the decision tree plot
plt.show()
In [58]:
# Import necessary libraries
from sklearn.metrics import mean_squared_error

# Calculate the Root Mean Squared Error (RMSE) between the true 'y_test' values and the predicted 'y_pred' values
# The 'squared=False' argument ensures that the RMSE is returned in its original scale
# Print the calculated RMSE for model evaluation

print("Model1 RMSE "+str(mean_squared_error(y_test,y_pred,squared=False)))
Model1 RMSE 0.11492390174981869
In [59]:
# Import necessary libraries
from sklearn.metrics import mean_absolute_error

# Print the calculated MAE for model evaluation
print("Model1 MAE "+str(mean_absolute_error(y_test,y_pred)))
Model1 MAE 0.088717941065767
In [60]:
# Import necessary libraries
import statsmodels.api as sm

# Add a constant column to the feature matrix 'X_train'
x = sm.add_constant(X_train)

# Fit an Ordinary Least Squares (OLS) regression model using 'x' as the feature matrix and 'y_train' as the target variable
model = sm.OLS(y_train, X_train).fit()

# Print the summary of the fitted OLS model, including statistical information and model performance metrics
print(model.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                      y   R-squared (uncentered):                   0.980
Model:                            OLS   Adj. R-squared (uncentered):              0.980
Method:                 Least Squares   F-statistic:                          4.295e+04
Date:                Tue, 15 Aug 2023   Prob (F-statistic):                        0.00
Time:                        02:12:49   Log-Likelihood:                          30437.
No. Observations:               24585   AIC:                                 -6.082e+04
Df Residuals:                   24557   BIC:                                 -6.059e+04
Df Model:                          28                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            -0.0314      0.071     -0.440      0.660      -0.171       0.108
x2             0.0880      0.037      2.396      0.017       0.016       0.160
x3            -0.1169      0.040     -2.913      0.004      -0.196      -0.038
x4             0.0256      0.005      5.548      0.000       0.017       0.035
x5            -0.0772      0.003    -27.600      0.000      -0.083      -0.072
x6            -0.0345      0.014     -2.531      0.011      -0.061      -0.008
x7            -0.0226      0.001    -15.496      0.000      -0.025      -0.020
x8            -0.3243      0.025    -13.175      0.000      -0.373      -0.276
x9            -0.0208      0.013     -1.592      0.111      -0.047       0.005
x10           -0.2245      0.037     -6.059      0.000      -0.297      -0.152
x11           -0.0080      0.002     -4.003      0.000      -0.012      -0.004
x12           -0.1311      0.004    -34.079      0.000      -0.139      -0.124
x13           -0.1365      0.005    -27.899      0.000      -0.146      -0.127
x14            0.0336      0.002     15.619      0.000       0.029       0.038
x15            0.6034      0.006    102.250      0.000       0.592       0.615
x16            0.2417      0.004     64.225      0.000       0.234       0.249
x17            0.1260      0.005     24.176      0.000       0.116       0.136
x18           -0.3281      0.003   -100.019      0.000      -0.335      -0.322
x19           -0.0123      0.004     -3.266      0.001      -0.020      -0.005
x20            0.4187      0.004    110.561      0.000       0.411       0.426
x21            0.1156      0.004     31.148      0.000       0.108       0.123
x22            0.0063      0.003      1.933      0.053   -8.92e-05       0.013
x23           -0.0617      0.006    -10.636      0.000      -0.073      -0.050
x24            0.0508      0.013      3.772      0.000       0.024       0.077
x25            0.0253      0.005      5.369      0.000       0.016       0.035
x26            0.3834      0.023     16.423      0.000       0.338       0.429
x27            0.1576      0.014     11.593      0.000       0.131       0.184
x28            0.1182      0.023      5.084      0.000       0.073       0.164
==============================================================================
Omnibus:                      322.452   Durbin-Watson:                   1.980
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              591.712
Skew:                           0.027   Prob(JB):                    3.25e-129
Kurtosis:                       3.758   Cond. No.                         514.
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [ ]:
 
In [61]:
# Add a constant column to the feature matrix 'X_test'
x = sm.add_constant(X_test)

# Use the fitted OLS model 'model' to predict the target variable for the test dataset
predicted_list=model.predict(X_test)

# Display the predicted values for the test dataset
predicted_list
Out[61]:
array([0.5302908 , 0.60057654, 0.64597567, ..., 0.5666231 , 0.46907775,
       0.22274939])
In [62]:
from sklearn.metrics import mean_squared_error
# Calculate the Root Mean Squared Error (RMSE) between the true 'y_test' values and the predicted 'predicted_list' values
# The 'squared=False' argument ensures that the RMSE is returned in its original scale
# Print the calculated RMSE for model evaluation
print("Model1 RMSE "+str(mean_squared_error(y_test,predicted_list,squared=False)))
Model1 RMSE 0.08678651264112534
In [63]:
# Import necessary libraries
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# Create a Lasso regression model with a specified regularization parameter (alpha)
reg = linear_model.Lasso(alpha=0.1)
# Fit the Lasso regression model using the training data ('X_train' and 'y_train')
reg.fit(X_train,y_train)
# Calculate and print the R-squared score of the Lasso model on the training data
print(reg.score(X_train,y_train))
0.0
In [64]:
# Import necessary libraries
from sklearn.metrics import mean_squared_error
# Use the trained Lasso model 'reg' to predict the target variable for the test dataset
predicted_list=reg.predict(X_test)
# Print the calculated RMSE for model evaluation
print("Model1 RMSE "+str(mean_squared_error(y_test,predicted_list,squared=False)))
Model1 RMSE 0.19958037729911932
In [66]:
from sklearn.decomposition import PCA
import numpy as np  # Import NumPy



# Fit PCA and calculate explained variance
pca = PCA()
pca.fit(X_train)  # X is your data matrix
explained_variance_ratio = pca.explained_variance_ratio_



# Plot cumulative explained variance
import matplotlib.pyplot as plt
plt.figure(figsize=(8,6))
plt.plot(np.cumsum(explained_variance_ratio))
plt.xlabel('Number of Principal Components')
plt.ylabel('Cumulative Explained Variance')
plt.show()
In [67]:
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# Step 1: Normalize the data using StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled=scaler.fit_transform(X_test)
# Step 2: Perform PCA
pca = PCA(n_components=15)  # Create a PCA object with 15 components
X_pca_train = pca.fit_transform(X_scaled) # Apply PCA to the normalized training data
X_pca_test = pca.fit_transform(X_test_scaled)  # Apply PCA to the normalized test data
#for 15 components 95% variance captured
In [69]:
X_pca_train
Out[69]:
array([[ 0.35664746, -1.62628323, -1.3136558 , ..., -1.34318139,
         0.84314037,  0.74094237],
       [-0.17754903, -1.99563622,  2.83850355, ...,  1.19729734,
         0.56352396,  0.1770781 ],
       [ 1.96399645,  4.17669541, -1.72121425, ..., -0.32360681,
         0.76786134, -0.60591091],
       ...,
       [-1.81625605,  3.98642503,  3.25749881, ..., -0.09425742,
        -0.65614435,  0.17474444],
       [ 0.49136231,  5.29209063,  0.17174245, ..., -0.66332313,
         0.83978913, -1.22883956],
       [ 1.35514104,  0.91199773, -0.55144859, ..., -0.99966362,
        -2.04312735,  0.31304693]])
In [70]:
X_pca_test
Out[70]:
array([[-0.67809883,  3.97624719, -2.06335347, ...,  0.47986341,
         0.05772155,  0.18039093],
       [-1.95231258,  0.43662293,  0.26922138, ...,  0.36639487,
         0.12314122, -0.28634634],
       [-1.15259208, -1.88037732, -0.62508409, ..., -1.26402646,
        -1.40102374, -0.76403093],
       ...,
       [ 1.64161227,  4.07773423,  2.79488827, ..., -0.94587568,
        -0.91457452, -0.8681216 ],
       [-0.32587695,  0.22216312, -2.23782486, ...,  1.21063252,
         0.83198182,  0.34105151],
       [ 3.81844739, -1.41876669,  0.51154829, ...,  0.4667914 ,
         0.21061727, -1.32840204]])
In [71]:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
# Create a RandomForestRegressor object with a specified maximum depth and random seed
regr = RandomForestRegressor(max_depth=2, random_state=0)
# Fit the RandomForestRegressor model using the transformed training data 'X_pca_train' and target values 'y_train'
regr.fit(X_pca_train, y_train)
Out[71]:
RandomForestRegressor(max_depth=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=2, random_state=0)
In [72]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV, KFold
# Define a parameter grid for hyperparameter tuning
param_grid = {
    'ccp_alpha': [0.1, 1, 10],
    "max_depth":[2,4,6]
}

# Create a KFold cross-validation object with 5 splits
kf = KFold(n_splits=5, shuffle=True, random_state=41)
# Create a GridSearchCV object to perform hyperparameter tuning
grid_search = GridSearchCV(estimator=regr, param_grid=param_grid, cv=kf, scoring='neg_mean_squared_error')
# Fit the GridSearchCV on the transformed training data and target values
grid_search.fit(X_pca_train, y_train.reshape(-1))
# Retrieve the best parameters found during the grid search
best_params = grid_search.best_params_

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_pca_test, y_test.reshape(-1))

# Print the results of hyperparameter tuning and evaluation
print("Best Parameters:", best_params)
print("Test Accuracy:", test_accuracy)
Best Parameters: {'ccp_alpha': 0.1, 'max_depth': 2}
Test Accuracy: -0.0009079569577314928
In [73]:
import sklearn.metrics as met
# Retrieve the keys (names) of all available scoring metrics
met.SCORERS.keys()
Out[73]:
dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'jaccard', 'jaccard_macro', 'jaccard_micro', 'jaccard_samples', 'jaccard_weighted'])
In [74]:
y_train
Out[74]:
array([[0.23585393],
       [0.40285074],
       [0.29317027],
       ...,
       [0.49797606],
       [0.35685987],
       [0.4940143 ]])
In [75]:
# Retrieve the keys (names) of the hyperparameters of the RandomForestRegressor
regr.get_params().keys()
Out[75]:
dict_keys(['bootstrap', 'ccp_alpha', 'criterion', 'max_depth', 'max_features', 'max_leaf_nodes', 'max_samples', 'min_impurity_decrease', 'min_samples_leaf', 'min_samples_split', 'min_weight_fraction_leaf', 'n_estimators', 'n_jobs', 'oob_score', 'random_state', 'verbose', 'warm_start'])
In [76]:
# Create subplots with 1 row and 6 columns
fig, axes = plt.subplots(nrows = 1,ncols = 6,figsize = (10,2), dpi=900)
# Iterate through the first 6 estimators in the RandomForestRegressor
for index in range(0, 6):
      # Plot the decision tree for the current estimator
    tree.plot_tree(regr.estimators_[index],
                   feature_names=['temp', 'temp_min', 'temp_max', 'pressure', 'humidity', 'wind_speed',
       'wind_deg', 'rain_1h', 'rain_3h', 'snow_3h', 'clouds_all', 'weather_id',
       'generation biomass', 'generation fossil brown coal/lignite',
       'generation fossil gas', 'generation fossil hard coal',
       'generation fossil oil', 'generation hydro pumped storage consumption',
       'generation hydro run-of-river and poundage',
       'generation hydro water reservoir', 'generation nuclear',
       'generation other', 'generation other renewable', 'generation solar',
       'generation waste', 'generation wind onshore',
       'forecast solar day ahead', 'forecast wind onshore day ahead'],
                   filled = True,
                   ax = axes[index]);
        # Set the title for the current subplot
    axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
# Save the figure as an image file
fig.savefig('rf_5trees.png')
In [77]:
X_pca_test
Out[77]:
array([[-0.67809883,  3.97624719, -2.06335347, ...,  0.47986341,
         0.05772155,  0.18039093],
       [-1.95231258,  0.43662293,  0.26922138, ...,  0.36639487,
         0.12314122, -0.28634634],
       [-1.15259208, -1.88037732, -0.62508409, ..., -1.26402646,
        -1.40102374, -0.76403093],
       ...,
       [ 1.64161227,  4.07773423,  2.79488827, ..., -0.94587568,
        -0.91457452, -0.8681216 ],
       [-0.32587695,  0.22216312, -2.23782486, ...,  1.21063252,
         0.83198182,  0.34105151],
       [ 3.81844739, -1.41876669,  0.51154829, ...,  0.4667914 ,
         0.21061727, -1.32840204]])
In [78]:
from sklearn.metrics import mean_squared_error
y_pred_RF=regr.predict(X_pca_test)
print("Model1 MSE "+str(mean_squared_error(y_test,y_pred_RF)))
Model1 MSE 0.026724733755228974
In [79]:
from sklearn.metrics import mean_absolute_error
# Predict using the RandomForestRegressor
y_pred_RF=regr.predict(X_pca_test)
# Calculate and print the Mean Absolute Error (MAE)
print("Model1 MAE "+str(mean_absolute_error(y_test,y_pred_RF)))
Model1 MAE 0.13437720295643518
In [80]:
from sklearn.metrics import mean_absolute_percentage_error
# Predict using the RandomForestRegressor
y_pred_RF=regr.predict(X_pca_test)
# Calculate and print the Mean Absolute Percentage Error (MAPE)
print("Model1 MAPE "+str(mean_absolute_percentage_error(y_test,y_pred_RF)))
Model1 MAPE 134949500586.28116
In [81]:
# Predict using the RandomForestRegressor on the training data
y_pred_RF_train=regr.predict(X_pca_train)
In [82]:
# Create a DataFrame to store the predicted energy values
energy_series=pd.DataFrame(y_pred_RF,columns=["Energy_predictions"])
In [83]:
# Add a column to the energy_series DataFrame for actual energy values
energy_series["energy_actual"]=y_test
In [84]:
# Calculate the rolling mean of the predicted energy values
energy_mean = energy_series.rolling(window=30*24).mean()
# Create a plot of the rolling mean of predicted energy values
energy_mean.plot(figsize=(20,15))
# random forest
Out[84]:
<Axes: >
In [85]:
X_test_reshaped=X_pca_test.reshape(10537, 1, 15)
In [86]:
X_train.shape
Out[86]:
(24585, 28)
In [87]:
X_train_reshaped = X_pca_train.reshape(24585, 1, 15)
In [88]:
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


# Build an RNN model
model = Sequential()
model.add(SimpleRNN(units=32, activation='relu', input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
model.add(Dense(units=1))

# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train_reshaped, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
y_pred_RNN = model.predict(X_test_reshaped)
mse = mean_squared_error(y_test, y_pred_RNN)
print("Mean Squared Error:", mse)
Epoch 1/10
615/615 [==============================] - 3s 3ms/step - loss: 0.0805 - val_loss: 0.0230
Epoch 2/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0174 - val_loss: 0.0140
Epoch 3/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0121 - val_loss: 0.0108
Epoch 4/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0098 - val_loss: 0.0089
Epoch 5/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0084 - val_loss: 0.0079
Epoch 6/10
615/615 [==============================] - 2s 4ms/step - loss: 0.0073 - val_loss: 0.0070
Epoch 7/10
615/615 [==============================] - 2s 4ms/step - loss: 0.0073 - val_loss: 0.0106
Epoch 8/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0116 - val_loss: 0.0253
Epoch 9/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0120 - val_loss: 0.0058
Epoch 10/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0058 - val_loss: 0.0058
330/330 [==============================] - 1s 3ms/step
Mean Squared Error: 0.013080175496853379
In [89]:
model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 simple_rnn (SimpleRNN)      (None, 32)                1536      
                                                                 
 dense (Dense)               (None, 1)                 33        
                                                                 
=================================================================
Total params: 1,569
Trainable params: 1,569
Non-trainable params: 0
_________________________________________________________________
In [90]:
# Predict using the RNN model on the reshaped test data
y_pred_RNN = model.predict(X_test_reshaped)
# Calculate the Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_RNN)
# Print the calculated Mean Squared Error
print("Mean Squared Error:", mse)
330/330 [==============================] - 1s 3ms/step
Mean Squared Error: 0.013080175496853379
In [91]:
# Predict using the RNN model on the reshaped test data
from sklearn.metrics import mean_absolute_error
# Calculate the Mean Absolute Error (MAE)
y_pred_RNN =model.predict(X_test_reshaped)
# Print the calculated Mean Absolute Error
print("Model1 MSE "+str(mean_absolute_error(y_test,y_pred_RNN )))
330/330 [==============================] - 1s 2ms/step
Model1 MSE 0.07403818270914564
In [92]:
# Predict using the RNN model on the reshaped test data
from sklearn.metrics import mean_absolute_percentage_error
# Calculate the Mean Absolute Percentage Error (MAPE)
y_pred_RNN =model.predict(X_test_reshaped)
# Print the calculated Mean Absolute Percentage Error
print("Model1 MAPE "+str(mean_absolute_percentage_error(y_test,y_pred_RNN)))
330/330 [==============================] - 1s 2ms/step
Model1 MAPE 58265804734.14871
In [93]:
y_pred_RNN_train=model.predict(X_train_reshaped)
769/769 [==============================] - 1s 2ms/step
In [94]:
# Create a DataFrame for storing RNN predictions
energy_series=pd.DataFrame(y_pred_RNN,columns=["Energy_predictions"])
# Add actual energy values to the DataFrame
energy_series["energy_actual"]=y_test
# Calculate rolling mean of energy predictions
energy_mean = energy_series.rolling(window=30*24).mean()
# Create a line plot of the rolling mean of energy predictions
energy_mean.plot(figsize=(20,15))
# RNN
Out[94]:
<Axes: >
In [95]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM

# Create a sequential model
model = Sequential()

# Add an LSTM layer
model.add(LSTM(units=64, return_sequences=True, input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
model.add(Dense(units=1))
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model
model.fit(X_train_reshaped, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Evaluate the model
y_pred_LSTM = model.predict(X_test_reshaped)
mse = mean_squared_error(y_test, y_pred_LSTM.reshape(10537))
print("Mean Squared Error:", mse)
Epoch 1/10
615/615 [==============================] - 6s 5ms/step - loss: 0.0164 - val_loss: 0.0064
Epoch 2/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0059 - val_loss: 0.0056
Epoch 3/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0054 - val_loss: 0.0053
Epoch 4/10
615/615 [==============================] - 4s 6ms/step - loss: 0.0052 - val_loss: 0.0050
Epoch 5/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0049 - val_loss: 0.0048
Epoch 6/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0047 - val_loss: 0.0047
Epoch 7/10
615/615 [==============================] - 2s 4ms/step - loss: 0.0046 - val_loss: 0.0048
Epoch 8/10
615/615 [==============================] - 3s 5ms/step - loss: 0.0045 - val_loss: 0.0044
Epoch 9/10
615/615 [==============================] - 3s 5ms/step - loss: 0.0044 - val_loss: 0.0045
Epoch 10/10
615/615 [==============================] - 3s 4ms/step - loss: 0.0043 - val_loss: 0.0045
330/330 [==============================] - 1s 2ms/step
Mean Squared Error: 0.011152442743992654
In [96]:
model.summary()
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 lstm (LSTM)                 (None, 1, 64)             20480     
                                                                 
 dense_1 (Dense)             (None, 1, 1)              65        
                                                                 
=================================================================
Total params: 20,545
Trainable params: 20,545
Non-trainable params: 0
_________________________________________________________________
In [97]:
y_pred_LSTM_train=model.predict(X_train_reshaped)
769/769 [==============================] - 2s 2ms/step
In [98]:
# Create a DataFrame for storing LSTM predictions
energy_series=pd.DataFrame(y_pred_LSTM.reshape(10537),columns=["Energy_predictions"])
# Add actual energy values to the DataFrame
energy_series["energy_actual"]=y_test
# Calculate rolling mean of energy predictions
energy_mean = energy_series.rolling(window=30*24).mean()
# Create a line plot of the rolling mean of energy predictions
energy_mean.plot(figsize=(10,6))
# LSTM
Out[98]:
<Axes: >
In [99]:
# Create a sequential model
model = Sequential()

# Add input layer with 64 units and ReLU activation function
# Add hidden layer with 32 units and ReLU activation function
model.add(Dense(units=64, activation='relu', input_shape=(X_train_reshaped.shape[1], X_train_reshaped.shape[2])))
model.add(Dense(units=32, activation='relu'))

# Add output layer with 1 unit (for regression)
model.add(Dense(units=1))

# Compile the model with Adam optimizer and mean squared error loss
model.compile(optimizer='adam', loss='mean_squared_error')

# Train the model with training data for 10 epochs and batch size of 32
model.fit(X_train_reshaped, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Predict energy consumption using the trained model
y_pred_MLP = model.predict(X_test_reshaped)
# Calculate mean squared error
mse = mean_squared_error(y_test, y_pred_MLP.reshape(10537))
# Print the mean squared error
print("Mean Squared Error:", mse)
Epoch 1/10
615/615 [==============================] - 3s 3ms/step - loss: 0.0427 - val_loss: 0.0155
Epoch 2/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0106 - val_loss: 0.0076
Epoch 3/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0068 - val_loss: 0.0063
Epoch 4/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0059 - val_loss: 0.0067
Epoch 5/10
615/615 [==============================] - 2s 4ms/step - loss: 0.0077 - val_loss: 0.0497
Epoch 6/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0163 - val_loss: 0.0060
Epoch 7/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0050 - val_loss: 0.0051
Epoch 8/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0047 - val_loss: 0.0051
Epoch 9/10
615/615 [==============================] - 1s 2ms/step - loss: 0.0046 - val_loss: 0.0046
Epoch 10/10
615/615 [==============================] - 2s 3ms/step - loss: 0.0046 - val_loss: 0.0047
330/330 [==============================] - 1s 2ms/step
Mean Squared Error: 0.006490054271212503
In [100]:
model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_2 (Dense)             (None, 1, 64)             1024      
                                                                 
 dense_3 (Dense)             (None, 1, 32)             2080      
                                                                 
 dense_4 (Dense)             (None, 1, 1)              33        
                                                                 
=================================================================
Total params: 3,137
Trainable params: 3,137
Non-trainable params: 0
_________________________________________________________________
In [101]:
# Predict energy consumption using the trained MLP model on the training data
y_pred_MLP_train=model.predict(X_train_reshaped)
769/769 [==============================] - 1s 1ms/step
In [102]:
(X_train_reshaped.shape[1], X_train_reshaped.shape[2])
Out[102]:
(1, 15)
In [103]:
#Create a DataFrame named energy_series with predicted energy values reshaped from y_pred_MLP
# Reshaping is done to make sure it matches the shape of y_test
energy_series=pd.DataFrame(y_pred_MLP.reshape(10537),columns=["Energy_predictions"])

# Add a new column to energy_series containing actual energy values from y_test
energy_series["energy_actual"]=y_test

# Calculate the rolling mean of the energy_series over a window of 30 days (30*24 hours)
energy_mean = energy_series.rolling(window=30*24).mean()


# Set the size of the plot to 20x15 inches
# Then, plot the rolling mean of energy_series using a line plot
energy_mean.plot(figsize=(20,15))
# MLP regression
Out[103]:
<Axes: >

Ensemble Model¶

In [140]:
predictions_df
Out[140]:
Random Forest RNN MLP LSTM CNN_LSTM Actual load
0 0.507204 0.212257 0.230389 0.188384 0.204368 0.235854
1 0.315739 0.418718 0.325366 0.316008 0.366726 0.402851
2 0.418931 0.287089 0.348004 0.308110 0.361589 0.293170
3 0.487849 0.605787 0.598816 0.562299 0.595291 0.615580
4 0.418931 0.271261 0.294836 0.285386 0.282516 0.375894
... ... ... ... ... ... ...
24580 0.317454 0.364139 0.298728 0.282859 0.369462 0.376023
24581 0.502812 0.562295 0.534996 0.509357 0.542074 0.517699
24582 0.460907 0.468982 0.495694 0.556511 0.589115 0.497976
24583 0.418931 0.410248 0.322879 0.331517 0.393089 0.356860
24584 0.492767 0.459145 0.454469 0.485176 0.521329 0.494014

24585 rows × 6 columns

In [151]:
# Import the statsmodels library with an alias 'sm'
import statsmodels.api as sm

# Create a new DataFrame named 'x' by adding a constant column to the predictions_df
# The constant column is necessary for the intercept term in the linear regression model
x = sm.add_constant(predictions_df[["RNN","MLP","LSTM"]])

# Create a linear regression model using Ordinary Least Squares (OLS)
model = sm.OLS(predictions_df[["Actual load"]],predictions_df[["RNN","MLP","LSTM"]]).fit()

# Print a summary of the linear regression model's statistics
print(model.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:            Actual load   R-squared (uncentered):                   0.984
Model:                            OLS   Adj. R-squared (uncentered):              0.984
Method:                 Least Squares   F-statistic:                          5.011e+05
Date:                Tue, 15 Aug 2023   Prob (F-statistic):                        0.00
Time:                        02:46:07   Log-Likelihood:                          33118.
No. Observations:               24585   AIC:                                 -6.623e+04
Df Residuals:                   24582   BIC:                                 -6.621e+04
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
RNN            0.0743      0.009      8.362      0.000       0.057       0.092
MLP            0.3641      0.012     29.843      0.000       0.340       0.388
LSTM           0.5774      0.013     42.807      0.000       0.551       0.604
==============================================================================
Omnibus:                      406.959   Durbin-Watson:                   1.989
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              703.738
Skew:                           0.131   Prob(JB):                    1.53e-153
Kurtosis:                       3.786   Cond. No.                         36.1
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [152]:
predictions_df
Out[152]:
Random Forest RNN MLP LSTM CNN_LSTM Actual load
0 0.507204 0.212257 0.230389 0.188384 0.204368 0.235854
1 0.315739 0.418718 0.325366 0.316008 0.366726 0.402851
2 0.418931 0.287089 0.348004 0.308110 0.361589 0.293170
3 0.487849 0.605787 0.598816 0.562299 0.595291 0.615580
4 0.418931 0.271261 0.294836 0.285386 0.282516 0.375894
... ... ... ... ... ... ...
24580 0.317454 0.364139 0.298728 0.282859 0.369462 0.376023
24581 0.502812 0.562295 0.534996 0.509357 0.542074 0.517699
24582 0.460907 0.468982 0.495694 0.556511 0.589115 0.497976
24583 0.418931 0.410248 0.322879 0.331517 0.393089 0.356860
24584 0.492767 0.459145 0.454469 0.485176 0.521329 0.494014

24585 rows × 6 columns

Voting Ensembling

In [155]:
from keras.models import Sequential
from keras.layers import Conv1D, MaxPooling1D, LSTM, Dense
import numpy as np

# Create a Sequential model
model = Sequential()
# Add a 1D convolutional layer with ReLU activation
model.add(Conv1D(filters=32, kernel_size=(1), activation='relu', input_shape=(1,15)))
# Add a MaxPooling1D layer
model.add(MaxPooling1D(pool_size=1))
# Add an LSTM layer with 64 units and return sequences
model.add(LSTM(units=64, return_sequences=True))

# Add Dense layer for final prediction
model.add(Dense(units=1, activation='relu'))

# Compile the model using Adam optimizer and mean squared error loss
model.compile(optimizer='adam', loss='mean_squared_error')

# Print model summary
model.summary()
# Fit the model to the training data
model.fit(X_pca_train.reshape(24585,1,15), y_train, epochs=10, batch_size=32)
Model: "sequential_4"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv1d_1 (Conv1D)           (None, 1, 32)             512       
                                                                 
 max_pooling1d_1 (MaxPooling  (None, 1, 32)            0         
 1D)                                                             
                                                                 
 lstm_2 (LSTM)               (None, 1, 64)             24832     
                                                                 
 dense_6 (Dense)             (None, 1, 1)              65        
                                                                 
=================================================================
Total params: 25,409
Trainable params: 25,409
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
769/769 [==============================] - 7s 5ms/step - loss: 0.0102
Epoch 2/10
769/769 [==============================] - 4s 5ms/step - loss: 0.0056
Epoch 3/10
769/769 [==============================] - 4s 6ms/step - loss: 0.0051
Epoch 4/10
769/769 [==============================] - 4s 5ms/step - loss: 0.0049
Epoch 5/10
769/769 [==============================] - 3s 4ms/step - loss: 0.0047
Epoch 6/10
769/769 [==============================] - 4s 6ms/step - loss: 0.0046
Epoch 7/10
769/769 [==============================] - 4s 5ms/step - loss: 0.0044
Epoch 8/10
769/769 [==============================] - 3s 4ms/step - loss: 0.0043
Epoch 9/10
769/769 [==============================] - 4s 5ms/step - loss: 0.0042
Epoch 10/10
769/769 [==============================] - 4s 6ms/step - loss: 0.0042
Out[155]:
<keras.callbacks.History at 0x7d47c17205b0>
In [156]:
# Use the trained model to make predictions on the test data

y_pred_CNN_LSTM=model.predict(X_test_reshaped)
330/330 [==============================] - 1s 2ms/step
In [157]:
# Use the trained model to make predictions on the reshaped training data

x=model.predict(X_train_reshaped)
769/769 [==============================] - 2s 2ms/step
In [158]:
# Reshape the predictions from the model into a 1D array

x.reshape(-1)
Out[158]:
array([0.21149546, 0.36228505, 0.3161803 , ..., 0.54253733, 0.3554708 ,
       0.48823586], dtype=float32)
In [159]:
# The variable X_test_reshaped contains the reshaped test data

X_test_reshaped
Out[159]:
array([[[-0.67809883,  3.97624719, -2.06335347, ...,  0.47986341,
          0.05772155,  0.18039093]],

       [[-1.95231258,  0.43662293,  0.26922138, ...,  0.36639487,
          0.12314122, -0.28634634]],

       [[-1.15259208, -1.88037732, -0.62508409, ..., -1.26402646,
         -1.40102374, -0.76403093]],

       ...,

       [[ 1.64161227,  4.07773423,  2.79488827, ..., -0.94587568,
         -0.91457452, -0.8681216 ]],

       [[-0.32587695,  0.22216312, -2.23782486, ...,  1.21063252,
          0.83198182,  0.34105151]],

       [[ 3.81844739, -1.41876669,  0.51154829, ...,  0.4667914 ,
          0.21061727, -1.32840204]]])
In [160]:
X_test_reshaped
Out[160]:
array([[[-0.67809883,  3.97624719, -2.06335347, ...,  0.47986341,
          0.05772155,  0.18039093]],

       [[-1.95231258,  0.43662293,  0.26922138, ...,  0.36639487,
          0.12314122, -0.28634634]],

       [[-1.15259208, -1.88037732, -0.62508409, ..., -1.26402646,
         -1.40102374, -0.76403093]],

       ...,

       [[ 1.64161227,  4.07773423,  2.79488827, ..., -0.94587568,
         -0.91457452, -0.8681216 ]],

       [[-0.32587695,  0.22216312, -2.23782486, ...,  1.21063252,
          0.83198182,  0.34105151]],

       [[ 3.81844739, -1.41876669,  0.51154829, ...,  0.4667914 ,
          0.21061727, -1.32840204]]])
In [161]:
# Create a DataFrame named energy_series with predicted energy values from the CNN LSTM model
energy_series=pd.DataFrame(y_pred_CNN_LSTM.reshape(-1),columns=["Energy_predictions"])
# Add a new column to energy_series containing actual energy values from y_test
energy_series["energy_actual"]=y_test
# Calculate the rolling mean of energy_series over a window of 30 days (30*24 hours)
energy_mean = energy_series.rolling(window=30*24).mean()

# Set the size of the plot to 20x10 inches
# Then, plot the rolling mean of energy_series using a line plot
energy_mean.plot(figsize=(20,10))
# CNN LSTM hybrid model
Out[161]:
<Axes: >
In [162]:
# Calculate the mean squared error (MSE) between the actual energy values (y_test) and predictions from the CNN LSTM model
mse = mean_squared_error(y_test, y_pred_CNN_LSTM.reshape(-1))
# Print the calculated mean squared error
print("Mean Squared Error:", mse)
Mean Squared Error: 0.006008444888229008
In [163]:
# Create a DataFrame named energy_series with energy predictions from y_train
energy_series=pd.DataFrame(y_train,columns=["Energy_predictions"])
In [164]:
# Calculate the rolling mean of energy predictions from the DataFrame energy_series
y=energy_series["Energy_predictions"].rolling(window=30*24).mean()[:1024]
In [165]:
y
Out[165]:
0            NaN
1            NaN
2            NaN
3            NaN
4            NaN
          ...   
1019    0.453184
1020    0.452730
1021    0.452862
1022    0.452641
1023    0.452862
Name: Energy_predictions, Length: 1024, dtype: float64
In [166]:
energy_series
Out[166]:
Energy_predictions
0 0.235854
1 0.402851
2 0.293170
3 0.615580
4 0.375894
... ...
24580 0.376023
24581 0.517699
24582 0.497976
24583 0.356860
24584 0.494014

24585 rows × 1 columns

In [167]:
len(X_test_reshaped)
Out[167]:
10537
In [168]:
y_train
Out[168]:
array([[0.23585393],
       [0.40285074],
       [0.29317027],
       ...,
       [0.49797606],
       [0.35685987],
       [0.4940143 ]])
In [169]:
# Create a DataFrame named predictions_df containing predictions from different models and actual load values

predictions_df=pd.DataFrame(data=y_pred_RF_train,columns=["Random Forest"])
predictions_df["RNN"]=y_pred_RNN_train.reshape(24585)
predictions_df["MLP"]=y_pred_MLP_train.reshape(24585)
predictions_df["LSTM"]=y_pred_LSTM_train.reshape(24585)
predictions_df["CNN_LSTM"]=x.reshape(24585)
predictions_df["Actual load"]=y_train
In [170]:
predictions_df
Out[170]:
Random Forest RNN MLP LSTM CNN_LSTM Actual load
0 0.507204 0.212257 0.230389 0.188384 0.211495 0.235854
1 0.315739 0.418718 0.325366 0.316008 0.362285 0.402851
2 0.418931 0.287089 0.348004 0.308110 0.316180 0.293170
3 0.487849 0.605787 0.598816 0.562299 0.597373 0.615580
4 0.418931 0.271261 0.294836 0.285386 0.250318 0.375894
... ... ... ... ... ... ...
24580 0.317454 0.364139 0.298728 0.282859 0.331058 0.376023
24581 0.502812 0.562295 0.534996 0.509357 0.541966 0.517699
24582 0.460907 0.468982 0.495694 0.556511 0.542537 0.497976
24583 0.418931 0.410248 0.322879 0.331517 0.355471 0.356860
24584 0.492767 0.459145 0.454469 0.485176 0.488236 0.494014

24585 rows × 6 columns

In [171]:
# Create a DataFrame named predictions_df_test containing predictions from different models, actual load values, and y_actual
predictions_df_test=pd.DataFrame(data=y_pred_RF,columns=["Random Forest"])
predictions_df_test["RNN"]=y_pred_RNN.reshape(10537)
predictions_df_test["MLP"]=y_pred_MLP.reshape(10537)
predictions_df_test["LSTM"]=y_pred_LSTM.reshape(10537)
predictions_df_test["CNN_LSTM"]=y_pred_CNN_LSTM.reshape(10537)
predictions_df_test["y_actual"]=y_test
In [172]:
predictions_df_test
Out[172]:
Random Forest RNN MLP LSTM CNN_LSTM y_actual
0 0.481352 0.443365 0.574335 0.560276 0.583084 0.570911
1 0.517339 0.624323 0.547839 0.664262 0.560572 0.567597
2 0.516851 0.673315 0.677402 0.682022 0.654718 0.605277
3 0.553069 0.691995 0.653069 0.627103 0.666148 0.600262
4 0.317454 0.273155 0.278417 0.222353 0.266471 0.162102
... ... ... ... ... ... ...
10532 0.442921 0.565241 0.482390 0.611821 0.521983 0.579067
10533 0.315739 0.249783 0.141433 0.176199 0.165419 0.190493
10534 0.488384 0.524229 0.520902 0.569639 0.572822 0.579546
10535 0.515490 0.536594 0.581097 0.332666 0.570930 0.444570
10536 0.317454 0.264106 0.225027 0.247447 0.272645 0.306454

10537 rows × 6 columns

ENSEMBLED STACK

In [173]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.linear_model import LinearRegression
# Define the parameter grid for hyperparameter search
param_grid = {
    'n_jobs': [1,2,3,4,5,6,7,8,9,10]
}
# Create a Linear Regression model
model = LinearRegression()
# Initialize KFold cross-validation with 5 splits
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Perform GridSearchCV with negative mean squared error as scoring metric
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=kf, scoring='neg_mean_squared_error')
grid_search.fit(predictions_df[["RNN","MLP","LSTM","Random Forest","CNN_LSTM"]], predictions_df[["Actual load"]])
# Get the best model from the grid search
best_params = grid_search.best_params_

# Evaluate the best model on the test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(predictions_df_test[["RNN","MLP","LSTM","Random Forest","CNN_LSTM"]], predictions_df_test[["y_actual"]])

# Print the best parameters and test accuracy
print("Best Parameters:", best_params)
print("Test Accuracy:", test_accuracy)
Best Parameters: {'n_jobs': 1}
Test Accuracy: 0.8491999432211498
In [174]:
# Use the best_model to make predictions on the test set using the selected ensemble of models
predictions_df_test["Meta_model_prediction"]=best_model.predict(predictions_df_test[["RNN","MLP","LSTM","Random Forest","CNN_LSTM"]])
In [176]:
predictions_df_test
Out[176]:
Random Forest RNN MLP LSTM CNN_LSTM y_actual Meta_model_prediction
0 0.481352 0.443365 0.574335 0.560276 0.583084 0.570911 0.582810
1 0.517339 0.624323 0.547839 0.664262 0.560572 0.567597 0.581195
2 0.516851 0.673315 0.677402 0.682022 0.654718 0.605277 0.665526
3 0.553069 0.691995 0.653069 0.627103 0.666148 0.600262 0.649373
4 0.317454 0.273155 0.278417 0.222353 0.266471 0.162102 0.265362
... ... ... ... ... ... ... ...
10532 0.442921 0.565241 0.482390 0.611821 0.521983 0.579067 0.536351
10533 0.315739 0.249783 0.141433 0.176199 0.165419 0.190493 0.166835
10534 0.488384 0.524229 0.520902 0.569639 0.572822 0.579546 0.562654
10535 0.515490 0.536594 0.581097 0.332666 0.570930 0.444570 0.512188
10536 0.317454 0.264106 0.225027 0.247447 0.272645 0.306454 0.263070

10537 rows × 7 columns

In [177]:
# Calculate the rolling mean of the columns "y_actual" and "Meta_model_prediction" from predictions_df_test
energy_mean = predictions_df_test[["y_actual","Meta_model_prediction"]].rolling(window=30*24).mean()


# Set the size of the plot to 20x15 inches
# Then, plot the rolling mean using a line plot
energy_mean.plot(figsize=(20,15))
# Linear regression ensemble meta model
Out[177]:
<Axes: >
In [178]:
# Calculate the mean squared error (MSE) between the actual energy values (y_actual) and ensemble model predictions (Meta_model_prediction)
mse = mean_squared_error(predictions_df_test["y_actual"], predictions_df_test["Meta_model_prediction"])
# Print the calculated mean squared error
print("Mean Squared Error:", mse)
Mean Squared Error: 0.006001151615983355
In [180]:
best_model.coef_
Out[180]:
array([[-0.05035801,  0.23785579,  0.26249903, -0.03414418,  0.54491015]])
In [181]:
import statsmodels.api as sm
# Add a constant term to the independent variables
x = sm.add_constant(predictions_df[["RNN","MLP","LSTM","Random Forest","CNN_LSTM"]])
# Fit an Ordinary Least Squares (OLS) model
model = sm.OLS(predictions_df[["Actual load"]],predictions_df[["RNN","MLP","LSTM","Random Forest","CNN_LSTM"]]).fit()
# Print the summary of the OLS model
print(model.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:            Actual load   R-squared (uncentered):                   0.985
Model:                            OLS   Adj. R-squared (uncentered):              0.985
Method:                 Least Squares   F-statistic:                          3.173e+05
Date:                Tue, 15 Aug 2023   Prob (F-statistic):                        0.00
Time:                        02:55:23   Log-Likelihood:                          33770.
No. Observations:               24585   AIC:                                 -6.753e+04
Df Residuals:                   24580   BIC:                                 -6.749e+04
Df Model:                           5                                                  
Covariance Type:            nonrobust                                                  
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
RNN              -0.0376      0.009     -4.023      0.000      -0.056      -0.019
MLP               0.2186      0.013     17.444      0.000       0.194       0.243
LSTM              0.2637      0.016     16.736      0.000       0.233       0.295
Random Forest     0.0148      0.003      5.140      0.000       0.009       0.020
CNN_LSTM          0.5441      0.015     36.322      0.000       0.515       0.573
==============================================================================
Omnibus:                      460.593   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              819.988
Skew:                           0.144   Prob(JB):                    8.75e-179
Kurtosis:                       3.847   Cond. No.                         53.0
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [182]:
def max_of_last_n(series, n):
    max_values = []

    for i in range(len(series)):
        if i >= n:
            max_value = max(series[i - n + 1 : i + 1])
            max_values.append(max_value)
        else:
            max_values.append(None)  # Not enough data for the first n indices

    return max_values

# Example usage
data_series = [5, 8, 3, 12, 6, 9, 15, 7, 10, 20]
n = 3
result = max_of_last_n(data_series, n)
print(result)
[None, None, None, 12, 12, 12, 15, 15, 15, 20]

VOTING

In [183]:
# Calculate the ensemble prediction by averaging predictions from different models
predictions_df_test["model_predictions_voting"]=(predictions_df_test["Random Forest"]+
predictions_df_test["RNN"]+predictions_df_test["MLP"]+predictions_df_test["LSTM"]+predictions_df_test["CNN_LSTM"])/5
In [184]:
predictions_df_test
Out[184]:
Random Forest RNN MLP LSTM CNN_LSTM y_actual Meta_model_prediction model_predictions_voting
0 0.481352 0.443365 0.574335 0.560276 0.583084 0.570911 0.582810 0.528482
1 0.517339 0.624323 0.547839 0.664262 0.560572 0.567597 0.581195 0.582867
2 0.516851 0.673315 0.677402 0.682022 0.654718 0.605277 0.665526 0.640862
3 0.553069 0.691995 0.653069 0.627103 0.666148 0.600262 0.649373 0.638277
4 0.317454 0.273155 0.278417 0.222353 0.266471 0.162102 0.265362 0.271570
... ... ... ... ... ... ... ... ...
10532 0.442921 0.565241 0.482390 0.611821 0.521983 0.579067 0.536351 0.524871
10533 0.315739 0.249783 0.141433 0.176199 0.165419 0.190493 0.166835 0.209715
10534 0.488384 0.524229 0.520902 0.569639 0.572822 0.579546 0.562654 0.535195
10535 0.515490 0.536594 0.581097 0.332666 0.570930 0.444570 0.512188 0.507355
10536 0.317454 0.264106 0.225027 0.247447 0.272645 0.306454 0.263070 0.265336

10537 rows × 8 columns

In [185]:
# Calculate the rolling mean of the columns "y_actual," "Meta_model_prediction," and "model_predictions_voting" from predictions_df_test
energy_mean = predictions_df_test[["y_actual","Meta_model_prediction","model_predictions_voting"]].rolling(window=30*24).mean()

# Set the size of the plot to 20x15 inches
# Then, plot the rolling mean using a line plot
energy_mean.plot(figsize=(20,15))
Out[185]:
<Axes: >
In [186]:
from sklearn.metrics import mean_absolute_error
print("Model1 MSE "+str(mean_absolute_error(predictions_df_test["y_actual"],predictions_df_test["Meta_model_prediction"])))
Model1 MSE 0.06060945614065426
In [187]:
!jupyter nbconvert --to html datamining_final_project_ensemble_Kfold_validation_scalecorrelection(2).ipynb
/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `jupyter nbconvert --to html datamining_final_project_ensemble_Kfold_validation_scalecorrelection(2).ipynb'
In [188]:
!jupyter nbconvert --to html datamining_final_project_ensemble_Kfold_validation_scalecorrelection v.ipynb
[NbConvertApp] WARNING | pattern 'datamining_final_project_ensemble_Kfold_validation_scalecorrelection' matched no files
[NbConvertApp] WARNING | pattern 'v.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
        to various other formats.

        WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.

Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
    <cmd> --help-all

--debug
    set log level to logging.DEBUG (maximize logging output)
    Equivalent to: [--Application.log_level=10]
--show-config
    Show the application's configuration (human-readable format)
    Equivalent to: [--Application.show_config=True]
--show-config-json
    Show the application's configuration (json format)
    Equivalent to: [--Application.show_config_json=True]
--generate-config
    generate default config file
    Equivalent to: [--JupyterApp.generate_config=True]
-y
    Answer yes to any questions instead of prompting.
    Equivalent to: [--JupyterApp.answer_yes=True]
--execute
    Execute the notebook prior to export.
    Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
    Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
    Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
    read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
    Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
    Write notebook output to stdout instead of files.
    Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
    Run nbconvert in place, overwriting the existing notebook (only
            relevant when converting to notebook format)
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
    Clear output of current file and save in place,
            overwriting the existing notebook.
    Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
    Exclude input and output prompts from converted document.
    Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
    Exclude input cells and output prompts from converted document.
            This mode is ideal for generating code-free reports.
    Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
    Whether to allow downloading chromium if no suitable version is found on the system.
    Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
    Disable chromium security sandbox when converting to PDF..
    Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
    Shows code input. This flag is only useful for dejavu users.
    Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
    Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
    Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
    Whether the HTML in Markdown cells and cell outputs should be sanitized..
    Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
    Set the log level by value or name.
    Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
    Default: 30
    Equivalent to: [--Application.log_level]
--config=<Unicode>
    Full path of a config file.
    Default: ''
    Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
    The export format to be used, either one of the built-in formats
            ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
            or a dotted object name that represents the import path for an
            ``Exporter`` class
    Default: ''
    Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
    Name of the template to use
    Default: ''
    Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
    Name of the template file to use
    Default: None
    Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
    Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
    as prebuilt extension for the lab template)
    Default: 'light'
    Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
    Whether the HTML in Markdown cells and cell outputs should be sanitized.This
    should be set to True by nbviewer or similar tools.
    Default: False
    Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
    Writer class used to write the
                                        results of the conversion
    Default: 'FilesWriter'
    Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
    PostProcessor class used to write the
                                        results of the conversion
    Default: ''
    Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
    overwrite base name use for output files.
                can only be used when converting one notebook at a time.
    Default: ''
    Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
    Directory to write output(s) to. Defaults
                                  to output to the directory of each notebook. To recover
                                  previous default behaviour (outputting to the current
                                  working directory) use . as the flag value.
    Default: ''
    Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
    The URL prefix for reveal.js (version 3.x).
            This defaults to the reveal CDN, but can be any url pointing to a copy
            of reveal.js.
            For speaker notes to work, this must be a relative path to a local
            copy of reveal.js: e.g., "reveal.js".
            If a relative path is given, it must be a subdirectory of the
            current directory (from which the server is run).
            See the usage documentation
            (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
            for more details.
    Default: ''
    Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
    The nbformat version to write.
            Use this to downgrade notebooks.
    Choices: any of [1, 2, 3, 4]
    Default: 4
    Equivalent to: [--NotebookExporter.nbformat_version]

Examples
--------

    The simplest way to use nbconvert is

            > jupyter nbconvert mynotebook.ipynb --to html

            Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].

            > jupyter nbconvert --to latex mynotebook.ipynb

            Both HTML and LaTeX support multiple output templates. LaTeX includes
            'base', 'article' and 'report'.  HTML includes 'basic', 'lab' and
            'classic'. You can specify the flavor of the format used.

            > jupyter nbconvert --to html --template lab mynotebook.ipynb

            You can also pipe the output to stdout, rather than a file

            > jupyter nbconvert mynotebook.ipynb --stdout

            PDF is generated via latex

            > jupyter nbconvert mynotebook.ipynb --to pdf

            You can get (and serve) a Reveal.js-powered slideshow

            > jupyter nbconvert myslides.ipynb --to slides --post serve

            Multiple notebooks can be given at the command line in a couple of
            different ways:

            > jupyter nbconvert notebook*.ipynb
            > jupyter nbconvert notebook1.ipynb notebook2.ipynb

            or you can specify the notebooks list in a config file, containing::

                c.NbConvertApp.notebooks = ["my_notebook.ipynb"]

            > jupyter nbconvert --config mycfg.py

To see all available configurables, use `--help-all`.